GDAS015-NGS的可视化

NGS的可视化

我们将用以下的包来说明一下如何对NGS数据进行可视化。

1
2
3
4
5
6
#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

我们将会采用四种方式来查看NGS的数据,分别是IGV,plot,Gvizggbio

IGV

将这些文献从R library的目录中复制到当前的工作目标。首先将工作目录设置为源文件位置。我们需要使用Rsamtools包来对BAM文件加上索引,然后使用IGV来查看,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
file.copy(from=fl1,to=basename(fl1))
## [1] FALSE
file.copy(from=fl2,to=basename(fl2))
## [1] FALSE
library(Rsamtools)
## Loading required package: Biostrings
## Loading required package: XVector
indexBam(basename(fl1))
## untreated1_chr4.bam
## "untreated1_chr4.bam.bai"
indexBam(basename(fl2))
## untreated3_chr4.bam
## "untreated3_chr4.bam.bai"

IGV可以在这个网站下载: https://www.broadinstitute.org/igv/home

下载的时候需要提供你的电子邮箱,现在我们使用IGV来查看一个基因,例如igs

如果无法下载IGV,那么可以使用UCSC的 Genome Browser:http://genome.ucsc.edu/cgi-bin/hgTracks

UCSC Genome Browser是一个很好的资源,有许多涉及基因注释,多个物种的保护,以及ENCODE表观遗传信号。但是,UCSC Genome Browser要求你将基因组文件上传到他们的服务器,或将数据放在公共可用服务器上。如果你处理的数据涉及保密,那就另当别论了。

Simple plot

现在我们来看一使用plot()函数来绘制基因。

1
library(GenomicRanges)

注意:如果你使用的是Bioconductor 14,它与R 3.1配合使用,需要加载这个包。如想是Bioconductor 13(它与R 3.0.x配合使用),则不需要加载这个包。

1
2
library(GenomicAlignments)
## Loading required package: SummarizedExperiment

我们从文件fl1中读取比对结果(alignments)。然后我们使用coverage函数来计算碱基对的覆盖。然后我们提取与我们感兴趣的基因重叠的覆盖区子集,并将此覆盖率从RleList转换为numeric向量。我们前面提到,Rle对象是一种压缩对象,例如重复数字会被压缩为2个数字,其中一个是重复的数字,另外一个是表示这个数字重复的次数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x <- readGAlignments(fl1)
xcov <- coverage(x)
z <- GRanges("chr4",IRanges(456500,466000))
# Bioconductor 2.14
xcov[z]
## RleList of length 1
## $$chr4
## integer-Rle of length 9501 with 1775 runs
## Lengths: 1252 10 52 4 7 2 ... 10 7 12 392 75 1041
## Values : 0 2 3 4 5 6 ... 3 2 1 0 1 0
# Bioconductor 2.13
xcov$chr4[ranges(z)]
## integer-Rle of length 9501 with 1775 runs
## Lengths: 1252 10 52 4 7 2 ... 10 7 12 392 75 1041
## Values : 0 2 3 4 5 6 ... 3 2 1 0 1 0
xnum <- as.numeric(xcov$chr4[ranges(z)])
plot(xnum)

plot of chunk unnamed-chunk-5

现在我们读取另外一个文件fl2,如下所示:

1
2
3
4
5
6
7
8
9
10
y <- readGAlignmentPairs(fl2)
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 63 pairs (0 proper, 63 not proper) were dropped because the seqname
## or strand of the alignments in the pair were not concordant.
## Note that a GAlignmentPairs object can only hold concordant pairs at the
## moment, that is, pairs where the 2 alignments are on the opposite strands
## of the same reference sequence.
ycov <- coverage(y)
ynum <- as.numeric(ycov$chr4[ranges(z)])
plot(xnum, type="l", col="blue", lwd=2)
lines(ynum, col="red", lwd=2)

plot of chunk unnamed-chunk-6

我们可以单独放大一个外显子,如下所示:

1
2
plot(xnum, type="l", col="blue", lwd=2, xlim=c(6200,6600))
lines(ynum, col="red", lwd=2)

plot of chunk unnamed-chunk-7

使用转录本数据库提取感兴趣的基因

假设我们想可视化基因igs。我们可以从转录本数据库TxDb.Dmelanogaster.UCSC.dm3.ensGene 中提取它,但我们首先要知道这个基因的Ensembl基因名,现在我们使用前面学到的函数来实现这一功能,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# biocLite("biomaRt")
library(biomaRt)
m <- useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
lf <- listFilters(m)
lf[grep("name", lf$description, ignore.case=TRUE),]
## name
## 1 chromosome_name
## 17 with_flybasename_gene
## 19 with_flybasename_transcript
## 23 with_flybasename_translation
## 52 external_gene_name
## 66 flybasename_gene
## 67 flybasename_transcript
## 68 flybasename_translation
## 93 wikigene_name
## 107 go_parent_name
## 208 so_parent_name
## description
## 1 Chromosome name
## 17 with FlyBaseName gene ID(s)
## 19 with FlyBaseName transcript ID(s)
## 23 with FlyBaseName protein ID(s)
## 52 Associated Gene Name(s) [e.g. BRCA2]
## 66 FlyBaseName Gene ID(s) [e.g. cul-2]
## 67 FlyBaseName Transcript ID(s) [e.g. cul-2-RB]
## 68 FlyBaseName Protein ID(s) [e.g. cul-2-PB]
## 93 WikiGene Name(s) [e.g. Ir21a]
## 107 Parent term name
## 208 Parent term name
map <- getBM(mart = m,
attributes = c("ensembl_gene_id", "flybasename_gene"),
filters = "flybasename_gene",
values = "lgs")
map
## ensembl_gene_id flybasename_gene
## 1 FBgn0039907 lgs

现在我们提取每个基因的外显子,然后提取igs基因的外显子,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(GenomicFeatures)
grl <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
gene <- grl[[map$ensembl_gene_id[1]]]
gene
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr4 [457583, 459544] - | 63350 <NA>
## [2] chr4 [459601, 459791] - | 63351 <NA>
## [3] chr4 [460074, 462077] - | 63352 <NA>
## [4] chr4 [462806, 463015] - | 63353 <NA>
## [5] chr4 [463490, 463780] - | 63354 <NA>
## [6] chr4 [463839, 464533] - | 63355 <NA>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

最后,我们绘制出这些外显子的范围,如下所示:

1
2
3
4
5
rg <- range(gene)
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
end(gene),rep(0,length(gene)),
lwd=3, length=.1)

plot of chunk unnamed-chunk-10

不过实际上这个基因位于负链,因此我们需要改一下箭头的方向,如下所示:

1
2
3
4
5
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
end(gene),rep(0,length(gene)),
lwd=3, length=.1,
code=ifelse(as.character(strand(gene)[1]) == "+", 2, 1))

plot of chunk unnamed-chunk-11

Gviz

我们将简要展示两个在Bioconductor中可视化基因组数据的软件包。请注意,这两个包中的任何一个都可以用于绘制多种数据的图形。我们将在这里展示如何像以前一样制作覆盖图(converage plot):

1
2
3
4
5
6
#biocLite("Gviz")
library(Gviz)
## Loading required package: grid
gtrack <- GenomeAxisTrack()
atrack <- AnnotationTrack(gene, name = "Gene Model")
plotTracks(list(gtrack, atrack))

plot of chunk unnamed-chunk-12

提取覆盖。Gviz中的数据最好是是GRanges对象,因此我们需要将RleList转换为GRanges对象,如下所示:

1
2
3
4
5
xgr <- as(xcov, "GRanges")
ygr <- as(ycov, "GRanges")
dtrack1 <- DataTrack(xgr[xgr %over% z], name = "sample 1")
dtrack2 <- DataTrack(ygr[ygr %over% z], name = "sample 2")
plotTracks(list(gtrack, atrack, dtrack1, dtrack2))

plot of chunk unnamed-chunk-13

1
plotTracks(list(gtrack, atrack, dtrack1, dtrack2), type="polygon")

plot of chunk unnamed-chunk-13

ggbio

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#biocLite("ggbio")
library(ggbio)
## Loading required package: ggplot2
## Warning: replacing previous import by 'BiocGenerics::Position' when loading
## 'ggbio'
## Need specific help about ggbio? try mailing
## the maintainer or visit http://tengfei.github.com/ggbio/
##
## Attaching package: 'ggbio'
## The following objects are masked from 'package:ggplot2':
##
## geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
## stat_identity, xlim
autoplot(gene)

plot of chunk unnamed-chunk-14

1
2
3
4
5
autoplot(fl1, which=z)
## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

plot of chunk unnamed-chunk-14

1
2
3
4
5
autoplot(fl2, which=z)
## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

plot of chunk unnamed-chunk-14

参考资料

  1. Visualizing NGS data
  2. IGV: https://www.broadinstitute.org/igv/home
  3. Gviz: http://www.bioconductor.org/packages/release/bioc/html/Gviz.html
  4. ggbio:http://www.bioconductor.org/packages/release/bioc/html/ggbio.html
  5. UCSC Genome Browser: zooms and scrolls over chromosomes, showing the work of annotators worldwide: http://genome.ucsc.edu/
  6. Ensembl genome browser: genome databases for vertebrates and other eukaryotic species:http://ensembl.org
  7. Roadmap Epigenome browser: public resource of human epigenomic data: http://www.epigenomebrowser.org http://genomebrowser.wustl.edu/ http://epigenomegateway.wustl.edu/
  8. “Sashimi plots” for RNA-Seq: http://genes.mit.edu/burgelab/miso/docs/sashimi.html
  9. Circos: designed for visualizing genomic data in a cirlce: http://circos.ca/
  10. SeqMonk: a tool to visualise and analyse high throughput mapped sequence data: http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/